getwd()
## [1] "C:/Users/Nirmal/Documents/Python Scripts"
setwd("C:/Users/Nirmal/Documents/Python Scripts")
getwd()
## [1] "C:/Users/Nirmal/Documents/Python Scripts"
data=read.csv("HousingData.csv")
sum(is.na(data))
## [1] 120
is.na(data)
##         CRIM    ZN INDUS  CHAS   NOX    RM   AGE   DIS   RAD   TAX PTRATIO     B LSTAT  MEDV
##   [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##   [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##   [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##   [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##   [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE  TRUE FALSE
##   [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##   [7,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##   [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##   [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [10,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [15,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [20,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [21,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [23,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [32,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [33,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [34,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [35,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [36,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE  TRUE FALSE
##  [37,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [38,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [39,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [40,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [41,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [42,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [43,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [44,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [45,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [46,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [47,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [48,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [49,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [50,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [51,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [52,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [53,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [54,]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [55,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [56,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [57,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [58,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [59,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [60,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [61,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [62,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [63,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [64,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [65,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [66,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [67,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [68,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [69,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [70,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [71,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE FALSE
##  [ reached getOption("max.print") -- omitted 435 rows ]
str(data)
## 'data.frame':    506 obs. of  14 variables:
##  $ CRIM   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ CHAS   : int  0 0 0 0 0 0 NA 0 0 NA ...
##  $ NOX    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ RM     : num  6.58 6.42 7.18 7 7.15 ...
##  $ AGE    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ DIS    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ PTRATIO: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ B      : num  397 397 393 395 397 ...
##  $ LSTAT  : num  4.98 9.14 4.03 2.94 NA ...
##  $ MEDV   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
data$CRIM[is.na(data$CRIM)]= mean(data$CRIM, na.rm = TRUE)
data$AGE[is.na(data$AGE)]= mean(data$AGE, na.rm= TRUE)
data$INDUS[is.na(data$INDUS)]= mean(data$INDUS, na.rm= TRUE)
data$LSTAT[is.na(data$LSTAT)]= mean(data$LSTAT, na.rm= TRUE)
data$CHAS[is.na(data$CHAS)]= 0
cor(data)
##                CRIM ZN       INDUS         CHAS         NOX         RM         AGE         DIS          RAD         TAX
## CRIM     1.00000000 NA  0.39116137 -0.053710495  0.41037672 -0.2154338  0.34493361 -0.36652274  0.608886320  0.56652782
## ZN               NA  1          NA           NA          NA         NA          NA          NA           NA          NA
## INDUS    0.39116137 NA  1.00000000  0.054172460  0.74096466 -0.3814574  0.61459225 -0.69963912  0.593176456  0.71606232
## CHAS    -0.05371049 NA  0.05417246  1.000000000  0.07086746  0.1067974  0.07354903 -0.09231841 -0.003339387 -0.03582225
## NOX      0.41037672 NA  0.74096466  0.070867463  1.00000000 -0.3021882  0.71146138 -0.76923011  0.611440563  0.66802320
## RM      -0.21543377 NA -0.38145737  0.106797424 -0.30218819  1.0000000 -0.24135070  0.20524621 -0.209846668 -0.29204783
## AGE      0.34493361 NA  0.61459225  0.073549029  0.71146138 -0.2413507  1.00000000 -0.72435308  0.449988663  0.50058938
## DIS     -0.36652274 NA -0.69963912 -0.092318410 -0.76923011  0.2052462 -0.72435308  1.00000000 -0.494587930 -0.53443158
## RAD      0.60888632 NA  0.59317646 -0.003339387  0.61144056 -0.2098467  0.44998866 -0.49458793  1.000000000  0.91022819
## TAX      0.56652782 NA  0.71606232 -0.035822250  0.66802320 -0.2920478  0.50058938 -0.53443158  0.910228189  1.00000000
## PTRATIO  0.27338389 NA  0.38480592 -0.109451496  0.18893268 -0.3555015  0.26272340 -0.23247054  0.464741179  0.46085304
## B       -0.37016342 NA -0.35459662  0.050607567 -0.38005064  0.1280686 -0.26528227  0.29151167 -0.444412816 -0.44180801
## LSTAT    0.43404449 NA  0.56735384 -0.047807594  0.57237922 -0.6029620  0.57489289 -0.48342926  0.468439666  0.52454474
## MEDV    -0.37969547 NA -0.47865733  0.183844439 -0.42732077  0.6953599 -0.38022344  0.24992873 -0.381626231 -0.46853593
##            PTRATIO           B       LSTAT       MEDV
## CRIM     0.2733839 -0.37016342  0.43404449 -0.3796955
## ZN              NA          NA          NA         NA
## INDUS    0.3848059 -0.35459662  0.56735384 -0.4786573
## CHAS    -0.1094515  0.05060757 -0.04780759  0.1838444
## NOX      0.1889327 -0.38005064  0.57237922 -0.4273208
## RM      -0.3555015  0.12806864 -0.60296205  0.6953599
## AGE      0.2627234 -0.26528227  0.57489289 -0.3802234
## DIS     -0.2324705  0.29151167 -0.48342926  0.2499287
## RAD      0.4647412 -0.44441282  0.46843967 -0.3816262
## TAX      0.4608530 -0.44180801  0.52454474 -0.4685359
## PTRATIO  1.0000000 -0.17738330  0.37334313 -0.5077867
## B       -0.1773833  1.00000000 -0.36888621  0.3334608
## LSTAT    0.3733431 -0.36888621  1.00000000 -0.7219746
## MEDV    -0.5077867  0.33346082 -0.72197464  1.0000000
anova=aov(MEDV~(CHAS+RAD), data=data)
summary(anova)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## CHAS          1   1444    1444   20.71 6.72e-06 ***
## RAD           1   6201    6201   88.94  < 2e-16 ***
## Residuals   503  35071      70                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data=data[,-c(2,4,8,12)]
sum(is.na(data))
## [1] 0
library(plotly)
fig1= plot_ly(data, x = ~CRIM, y = ~MEDV,
              type= 'scatter', mode="markers"
)
fig1= fig1 %>% layout(title= "Crime Rate vs Median Vaue")
fig1
fig3= plot_ly(data, x = ~RM, y = ~MEDV,
              type="scatter", mode="markers"
)%>%layout(title="RM vs Median Value")
fig3
fig5= plot_ly(data, x = ~LSTAT, y = ~MEDV,
              type="scatter", mode="markers"
)%>%layout(title="LSTAT vs Median Value")
fig5
fig6= plot_ly(data, x = ~PTRATIO, y = ~MEDV,
              type="scatter", mode="markers"
)%>%layout(title="PTRATIO vs Median Value")
fig6
fig7= plot_ly(data, x = ~TAX, y = ~MEDV,
              type="scatter", mode="markers"
)%>%layout(title="TAX vs Median Value")
fig7
fig8= plot_ly(data, x = ~INDUS, y = ~MEDV,
              type="scatter", mode="markers"
)%>%layout(title="INDUS vs Median Value")
fig8
fig9= plot_ly(data, x = ~NOX, y = ~MEDV,
              type="scatter", mode="markers"
)%>%layout(title="NOX vs Median Value")
fig9
hist(data$CRIM, col="red", main="Histogram of Crime Rate in Boston",
     xlab="Crime Rate", ylab="Frequency")

hist(data$TAX, col="red", main="Histogram of TAX rate in Boston",
     xlab="TAX", ylab="Frequency")

hist(data$RM, col="blue", main="Histogram of Rooms Per House in Boston",
     xlab="Rooms per house", ylab="Frequency")

hist(data$LSTAT, col="black", main="Histogram of Socioeconomic Status in Boston",
     xlab="LSTAT", ylab="Frequency")

hist(data$PTRATIO, col="green", main="Histogram of PTRATIO in Boston",
     xlab="PTRATIO", ylab="Frequency")

hist(data$INDUS, col="green", main="Histogram of INDUS in Boston",
     xlab="INDUS", ylab="Frequency")

hist(data$NOX, col="orange", main="Histogram of Nitric Oxide Concentration In Boston",
     xlab="NOX", ylab="Frequency")

hist(data$MEDV, col="pink", main="Histogram of Home Prices in Boston",
     xlab="MEDV", ylab="Frequency")

library(xgboost)
library(gbm)
library(caret)
library(caTools)
library(dplyr)
y= data$MEDV
x= data%>%select(-MEDV)
params= list(set.seed=1502, eval_metric="rmse", objective="reg:squarederror")
model_xgboost= xgboost(data=as.matrix(x), label=y, params=params, nrounds = 2, verbose=2)
## [20:35:09] WARNING: src/learner.cc:767: 
## Parameters: { "set_seed" } are not used.
## 
## [1]  train-rmse:17.072899 
## [2]  train-rmse:12.303709
x$PMEDV= predict(model_xgboost, data.matrix(x))
cor(x$PMEDV,y)
## [1] 0.9392105
set.seed(123)
sample= sample.split(data$MEDV, SplitRatio = 0.70)
trainset= subset(data, sample==TRUE)
testset= subset(data, sample==FALSE)
model_gbm= gbm(MEDV~., data= trainset, distribution="gaussian", cv.folds= 20, shrinkage= 0.01,
               n.minobsinnode=10, n.trees= 500)
testset$PMEDV= predict.gbm(model_gbm, testset)
## Using 500 trees...
cor(testset$PMEDV,testset$MEDV)
## [1] 0.766734
model_lm=lm(MEDV~., data=trainset)
testset$P2MEDV= predict(model_lm, testset)
cor(testset$P2MEDV, testset$MEDV)
## [1] 0.6987159
library(MLmetrics)

MSE(y_pred= x$PMEDV, y_true= y)
## [1] 151.3812
MSE(y_pred= testset$PMEDV, y_true= testset$MEDV)
## [1] 28.634
MSE(y_pred= testset$P2MEDV, y_true= testset$MEDV)
## [1] 38.22198
# From the above depicted scatter plots and histograms following conclusions can be made:
  
# 1. Lower crime rates in each town contribute to higher property values in the towns. Crime rates in Boston seem to be low. It is positively skewed in histogram which means that most of the towns have less crime rates with just few towns having higher crime rates.

# 2. More number of rooms in the houses is another component to deciding the prices of homes in Boston. Subsequently, houses with less number of rooms attribute partly to lower property values. Rooms per dwelling is normally distributed.

# 3. Socioeconomic status is positively skewed which indicates that people belonging to moderate and lower wages are concentrated predominantly in most of the towns with just few towns having predominant people belonging to higher wages category. Higher income earners in the data typically choose to buy properties that are cheap because these people may prioritize other aspects of their lifestyle over big and luxury homes such as spending their money on  travel, food, outing etc. They might have better financial planning and investment knowledge too.

# 4. Larger class sizes in the institutions also impact the housing price in Boston. People are attracted to buying houses in those towns wherein the institutions contain lower class strength. This is because in simple words, smaller the class size, better education resources. Smaller classes of the institutions perform relatively well in the exams than bigger classes because of more focus and attention applied on each individual in the class by the teacher. Therefore the house prices tend to be a bit higher in these towns.

# 5. Higher tax in the towns leads to lower property values as seen in the scatter plot. With just few towns offering high price for the houses where the tax rate is low. This is because homeowners tend to add more value to their property when they only have to pay less tax.

# 6. People are more interested to live in towns with retail businesses than in towns with non- retail businesses because they don't have to travel long distances for cloths, groceries, shopping etc. In the scatter plot, higher INDUS indicates people's dislike to live in such towns where there are more non-retail businesses therefore the homes in these towns are less priced. 

# 7. Finally, Nitric oxide concentration can be useful too in assessing the overall home prices. Higher NOX implies a signal for air pollution. As we all know, we don't prefer those places where pollution is more. Home prices are high in those towns were Nitric oxide concentration is seen very low. Towns with lower NOX indicate less air pollution therefore living there seems to more peaceful and cleaner. NOX is positively skewed which shows that most of the towns in Boston are less prone to air pollution because it contains less NOX concentration. 

# 8. Eventually we have used three algorithms namely Xgboost, Gradient Boosting Algorithm, and Linear Regression to predict the median value of home prices in Boston and subsequently checked the accuracy of each to show which has performed well on our model.